Data Carpentry: Using R to Analyze and Visualize Data (Part 1)

Evan Cortens, PhD

October 22, 2017

Why use R?

Why use R?

  • Free
  • Open-source
  • A “statistical” programming language
    • Generally accepted by academic statisticians
    • Thoroughly grounded in statistical theory
  • A great community
    • Help and support
    • Innovation and development
  • R is a stable, well-supported language
    • Major corporate backing, e.g., Microsoft

Growing fast!

Growing fast!

https://stackoverflow.blog/2017/10/10/impressive-growth-r/

Some history

  • Based on the S programming language
    • Dates from the late 1970s, substantially revised in the late 1980s
    • (there is still a commercial version, S-PLUS)
  • R was developed at the University of Auckland
    • Ross Ihaka and Robert Gentleman
    • First developed in 1993
    • Stable public release in 2000
  • RStudio
    • IDE, generally recognized as high quality, even outside R
    • Development began in 2008, first public release in February 2011
      • Version 1.0 in Fall 2016
    • There have been other IDEs for R in the past, but since RStudio came on the scene, most are no longer seriously maintained

Some history

Why is this relevant?

  • You’ll likely notice some quirks in R that differ from other programming languages you make have worked in–the unique history often explains why.
  • Explains R’s position as a “statistical programming language”:
    • Think stats/data first, programming second.

Today’s approach/Learning outcomes

Today’s approach/Learning outcomes

In learning R today, we’ll take an approach, used in R for Data Science (Wickham & Grolemund, 2016), diagrammed like this:

One of the strengths of R: every step of this workflow can be done using it!

Import

  • Loading the data!
  • For IR-type tasks, this will generally be from data files or directly from databases.
  • Other possibilities include web APIs, or even web scraping

Tidy

  • “Tidy data” (Wickham, 2014)
    • Each column is a variable, each row is an observation.
  • Doing this kind of work up front, right after loading, lets you focus on the modelling/analysis/visualization problem at hand, rather than having to rework your data at each stage of your analysis.

Transform/Visualize/Model

  • Repeat these three steps as necessary:

Transform

  • Together with tidying, sometimes called data wrangling
    • Filtering (selecting specific observations)
    • Mutating (creating new variables)
    • Summarizing (means, counts, etc)

Visualise

  • For both exploratory purposes and production/communication

Model

  • Complementary to visualisation
  • For the “precise exploration” of “specific questions”

Communicate

  • The final output, whether it’s just for yourself, your colleagues, or a wider audience
  • Increasingly, there’s a trend toward “reproducible research” which integrates even the communciation step (final paper, report, etc) into the code/analysis.

Housekeeping

Installing R and RStudio

Basic R

  • An interpreted language (like Python)
  • Code can be run as:
    • scripts (programatically/non-interactively)
    • from the prompt (interactively)
  • R uses an REPL (Read-Evaluate-Print Loop) just like Python
    • Using R as a calculator (demonstration)

Outline

Outline

  1. Visualisation
  2. Transformation
  3. Importing and ‘wrangling’
  4. Hands-on portion using what we’ve learned

Visualisation

Let’s load our package

library(tidyverse)

A package only needs to be installed once (per major version, e.g. 3.3.x to 3.4.x), but must be loaded every time.

The ‘mpg’ data set

Data on the fuel efficiency of 38 models of cars between 1999 and 2008 from the US EPA:

mpg
## # A tibble: 234 x 11
##    manufacturer      model displ  year   cyl      trans   drv   cty   hwy
##           <chr>      <chr> <dbl> <int> <int>      <chr> <chr> <int> <int>
##  1         audi         a4   1.8  1999     4   auto(l5)     f    18    29
##  2         audi         a4   1.8  1999     4 manual(m5)     f    21    29
##  3         audi         a4   2.0  2008     4 manual(m6)     f    20    31
##  4         audi         a4   2.0  2008     4   auto(av)     f    21    30
##  5         audi         a4   2.8  1999     6   auto(l5)     f    16    26
##  6         audi         a4   2.8  1999     6 manual(m5)     f    18    26
##  7         audi         a4   3.1  2008     6   auto(av)     f    18    27
##  8         audi a4 quattro   1.8  1999     4 manual(m5)     4    18    26
##  9         audi a4 quattro   1.8  1999     4   auto(l5)     4    16    25
## 10         audi a4 quattro   2.0  2008     4 manual(m6)     4    20    28
## # ... with 224 more rows, and 2 more variables: fl <chr>, class <chr>

The Grammar of Graphics

  • Layers
  • Inheritance
  • Mapping (aes)

Our First Plot

  • A car’s highway mileage (mpg) vs its engine size (displacement in litres).
  • What might the relationship be?

Our First Plot

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

  • As engine size increases, fuel efficiency decreases, roughly.

Exercises

  • The first scatterplot was highway mileage vs displacement: how can we make it city mileage vs displacement?
  • Make a scatterplot of hwy vs cyl.
  • What about a scatterplot of class vs drv?

Hwy vs cyl

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = hwy, y = cyl))

Class vs drv

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = class, y = drv))

What’s going on here? Is this useful? How might we make it more useful?

Additional aesthetics

What about those outliers? Use the color aesthetic.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

‘Unmapped’ aesthetics

What’s happening here?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

What’s gone wrong?

What’s happened here? What colour are the points? Why?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))

Exercises

  • mpg variable types: which are categorical, which are continuous?
  • ?mpg
  • Map a continuous variable to colour, size, and shape: how are these aesthetics different for categorical vs continuous?
  • What happens if you map an aesthetic to something other than a variable name, like aes(color = displ < 5)?

Facets: One Variable

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

Facets: Two Variables

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(drv ~ cyl)

Other “geoms” (geometries)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

Smooth

ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess'

Smooth aesthetics

ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))
## `geom_smooth()` using method = 'loess'

Which aesthetics do geoms have?

?geom_smooth

Clearer?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = drv)) +
  geom_smooth(mapping = aes(x = displ, y = hwy, color = drv, linetype = drv))
## `geom_smooth()` using method = 'loess'

Reducing Duplication

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point() +
  geom_smooth(mapping = aes(linetype = drv))
## `geom_smooth()` using method = 'loess'

Exercise: Recreate:

## `geom_smooth()` using method = 'loess'

Exercise: Recreate:

## `geom_smooth()` using method = 'loess'

Exercise: Recreate:

## `geom_smooth()` using method = 'loess'

One last visualization

A common scenario:

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")

Coding basics

R as a calculator

1 / 200 * 30
## [1] 0.15
(59 + 73 + 2) / 3
## [1] 44.66667
sin(pi / 2)
## [1] 1

Assignments

x <- 3 * 4

Calling functions

Calling R functions, in general:

function_name(arg1 = val1, arg2 = val2, ...)

An example function, seq:

seq(1, 10)
##  [1]  1  2  3  4  5  6  7  8  9 10

Data wrangling with dplyr and tidyr (tidyverse)

Create RStudio project

  1. File | New Project…
  2. New Directory
  3. Empty Project
  4. Under “Directory name:” put in “CIRPA 2017”, under “Create project as a subdirectory of:”, pick somewhere you’ll remember.
  5. Click “Create Project”, and RStudio will refresh, with your new project loaded.

Download the data

We’ll look at CANSIM Tables 0477-0058 and 0477-0059 which cover university revenues and expenditures, respectively.

To save time, I’ve already downloaded the data from CANSIM. You can get it here:

http://evancortens.com/cirpa/

Load the data

We’ll use read_csv() from readr, which I prefer to the base R function, as it has some more sensible defaults, and it’s faster.

# http://www20.statcan.gc.ca/tables-tableaux/cansim/csv/04770058-eng.zip
# make sure the zip file is in the same directory as your RStudio project

revenue_raw <- read_csv('04770058-eng.zip') 
## Parsed with column specification:
## cols(
##   Ref_Date = col_character(),
##   GEO = col_character(),
##   SCHOOL = col_character(),
##   REVENUE = col_character(),
##   FUND = col_character(),
##   Vector = col_character(),
##   Coordinate = col_character(),
##   Value = col_double()
## )

The read_csv() function tells us that it guessed the types of the various columns. In this situation, the default guesses are fine, but of course we can force it to treat columns certain ways if we wish.

What does the data look like?

revenue_raw
## # A tibble: 122,240 x 8
##     Ref_Date    GEO                          SCHOOL        REVENUE
##        <chr>  <chr>                           <chr>          <chr>
##  1 2000/2001 Canada Total universities and colleges Total revenues
##  2 2001/2002 Canada Total universities and colleges Total revenues
##  3 2002/2003 Canada Total universities and colleges Total revenues
##  4 2003/2004 Canada Total universities and colleges Total revenues
##  5 2004/2005 Canada Total universities and colleges Total revenues
##  6 2005/2006 Canada Total universities and colleges Total revenues
##  7 2006/2007 Canada Total universities and colleges Total revenues
##  8 2007/2008 Canada Total universities and colleges Total revenues
##  9 2008/2009 Canada Total universities and colleges Total revenues
## 10 2009/2010 Canada Total universities and colleges Total revenues
## # ... with 122,230 more rows, and 4 more variables: FUND <chr>,
## #   Vector <chr>, Coordinate <chr>, Value <dbl>

We have 8 columns and 122,240 rows of data. read_csv() brings the data in as a tibble, which is just an R “data frame”, but with some handy defaults, some of which we’re seeing here. For instance, it gives us the size of the data frame in rows and columns, the types of the columns (e.g., “<chr>” for character) and only prints the first 10 rows, instead of overwhelming us with all of the data.

What does the data look like?

View(revenue_raw) # in RStudio

Or click the icon in the Environment tab.

What does the data look like?

head(revenue_raw, 1) # show just the first row
  • Ref_Date: fiscal year
  • GEO: province or whole country
  • SCHOOL: Direct participation in CAUBO survey, or via Statistics Canada, or combined (we’ll just look at this)
  • REVENUE: Type of income (e.g., tuition, provincial government, SSHRC, etc)
  • FUND: Classification of income “in accordance with activities or objectives as specified by donors, … regulations, restrictions, or limitations” (We’re just going to look at “Total funds (x 1,000)”, but the distinctions are important for full analysis)
  • Vector: a CANSIM unique identifier
  • Coordinate: ditto
  • Value: the actual dollar value (x 1,000, per the FUND heading)

What makes data “tidy”?

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

(See Wickham, 2014 and R for Data Science, chapter 12)

Is the CANSIM data tidy?

  • Nope! So let’s tidy it.

dplyr basics

  • Pick observations by their values (filter()).
  • Reorder the rows (arrange()).
  • Pick variables by their names (select()).
  • Create new variables with functions of existing variables (mutate()).
  • Collapse many values down to a single summary (summarise()).

  • All can be used in conjunction with group_by()

dplyr cheat sheet

https://www.rstudio.com/resources/cheatsheets/

(or Help | Cheatsheets in RStudio!)

filter() the rows we want

filter(revenue_raw, 
       SCHOOL == 'Total universities and colleges', 
       FUND == 'Total funds (x 1,000)')
## # A tibble: 5,632 x 8
##     Ref_Date    GEO                          SCHOOL        REVENUE
##        <chr>  <chr>                           <chr>          <chr>
##  1 2000/2001 Canada Total universities and colleges Total revenues
##  2 2001/2002 Canada Total universities and colleges Total revenues
##  3 2002/2003 Canada Total universities and colleges Total revenues
##  4 2003/2004 Canada Total universities and colleges Total revenues
##  5 2004/2005 Canada Total universities and colleges Total revenues
##  6 2005/2006 Canada Total universities and colleges Total revenues
##  7 2006/2007 Canada Total universities and colleges Total revenues
##  8 2007/2008 Canada Total universities and colleges Total revenues
##  9 2008/2009 Canada Total universities and colleges Total revenues
## 10 2009/2010 Canada Total universities and colleges Total revenues
## # ... with 5,622 more rows, and 4 more variables: FUND <chr>,
## #   Vector <chr>, Coordinate <chr>, Value <dbl>

filter() help tells us that everything after the first argument (i.e., the “…”) are: “Logical predicates defined in terms of the variables in .data. Multiple conditions are combined with &.”

Comparison operators

>   # greater than
>=  # greater than or equal to
<   # less than
<=  # less than or equal to
!=  # not equal
==  # equal
  
%in% # not quite a comparison operator, but handy 
     # x %in% y is true if the values of vector x are present in vector y

Logical operators

Other operators we can use with filter():

Complete set of boolean operations. x is the left-hand circle, y is the right-hand circle, and the shaded regions show which parts each operator selects.

filter()

In other words, the previous command is equivalent to this:

filter(revenue_raw, 
       SCHOOL == 'Total universities and colleges' & 
         FUND == 'Total funds (x 1,000)')
## # A tibble: 5,632 x 8
##     Ref_Date    GEO                          SCHOOL        REVENUE
##        <chr>  <chr>                           <chr>          <chr>
##  1 2000/2001 Canada Total universities and colleges Total revenues
##  2 2001/2002 Canada Total universities and colleges Total revenues
##  3 2002/2003 Canada Total universities and colleges Total revenues
##  4 2003/2004 Canada Total universities and colleges Total revenues
##  5 2004/2005 Canada Total universities and colleges Total revenues
##  6 2005/2006 Canada Total universities and colleges Total revenues
##  7 2006/2007 Canada Total universities and colleges Total revenues
##  8 2007/2008 Canada Total universities and colleges Total revenues
##  9 2008/2009 Canada Total universities and colleges Total revenues
## 10 2009/2010 Canada Total universities and colleges Total revenues
## # ... with 5,622 more rows, and 4 more variables: FUND <chr>,
## #   Vector <chr>, Coordinate <chr>, Value <dbl>

I prefer this notation, as it’s more explicit.

filter()

But, one more thing: we need to assign the return value of the filter() function back to a variable:

revenue <- filter(revenue_raw, 
                  SCHOOL == 'Total universities and colleges' & 
                    FUND == 'Total funds (x 1,000)')

Shortcut for assignment operator: Alt-Hyphen in RStudio

Combining operations

A new variable for each step gets cumbersome, so dplyr provides an operator, the pipe (%>%) that combines operations:

revenue_long <- revenue_raw %>% 
  # only rows matching this
  filter(SCHOOL == 'Total universities and colleges' & 
                    FUND == 'Total funds (x 1,000)') %>% 
  # remove these columns
  select(-SCHOOL, -FUND, -Vector, -Coordinate) %>% 
  # fix up the date column
  mutate(Ref_Date = as.integer(stringr::str_sub(Ref_Date, 1, 4)))

x %>% f(y) turns into f(x, y), and x %>% f(y) %>% g(z) turns into g(f(x, y), z) etc.

Shortcut for pipe operator: Ctrl-Shift-M in RStudio

head(revenue_long, 1) # looks good so far!
## # A tibble: 1 x 4
##   Ref_Date    GEO        REVENUE    Value
##      <int>  <chr>          <chr>    <dbl>
## 1     2000 Canada Total revenues 16224715

Select helper functions

  • There are a number of helper functions you can use within select():
    • starts_with("abc"): matches names that begin with “abc”.
    • ends_with("xyz"): matches names that end with “xyz”.
    • contains("ijk"): matches names that contain “ijk”.
    • matches("(.)\\1"): selects variables that match a regular expression.
    • num_range("x", 1:3): matches x1, x2 and x3.
    • one_of(vector): columns whose names are in said vector.

Tidy data with tidyr

The two main tidyr functions:

  • gather(data, key, value): Moves column names into a key column, with the values going into a single value column.
  • spread(data, key, value): Moves unique values of key column into column names, with values from the value column.

tidyr Cheatsheet

https://www.rstudio.com/resources/cheatsheets/ (Data Import Cheatsheet)

spread() CANSIM

revenue <- revenue_long %>% 
  spread(REVENUE, Value)

revenue
## # A tibble: 176 x 34
##    Ref_Date                       GEO `Canada Foundation for Innovation`
##  *    <int>                     <chr>                              <dbl>
##  1     2000                   Alberta                               8208
##  2     2000          British Columbia                               5189
##  3     2000                    Canada                             212418
##  4     2000                  Manitoba                               7170
##  5     2000             New Brunswick                               1745
##  6     2000 Newfoundland and Labrador                               2799
##  7     2000               Nova Scotia                               3234
##  8     2000                   Ontario                              91066
##  9     2000      Prince Edward Island                                355
## 10     2000                    Quebec                              75988
## # ... with 166 more rows, and 31 more variables: `Canada Research
## #   Chairs` <dbl>, `Canadian Institute of Health Research` <dbl>, `Credit
## #   courses tuiton` <dbl>, `Donations made by business enterprises` <dbl>,
## #   `Donations made by individuals` <dbl>, `Donations made by
## #   not-for-profit organizations` <dbl>, Endowment <dbl>, Federal <dbl>,
## #   Foreign <dbl>, `Grants made by business enterprises` <dbl>, `Grants
## #   made by Individuals` <dbl>, `Grants made by not-for-profit
## #   organizations` <dbl>, `Health Canada` <dbl>, Investments <dbl>,
## #   Miscellaneous <dbl>, Municipal <dbl>, `Natural Sciences and
## #   Engineering Research Council` <dbl>, `Non-credit tution` <dbl>,
## #   `Non-federal` <dbl>, `Other federal` <dbl>, `Other fees` <dbl>, `Other
## #   investments` <dbl>, `Other provinces` <dbl>, `Other revenue
## #   type` <dbl>, Provincial <dbl>, `Sale of service and products` <dbl>,
## #   `Social Sciences and Humanities Research Council` <dbl>, `Total
## #   donations` <dbl>, `Total grants` <dbl>, `Total revenues` <dbl>,
## #   `Tuition and other fees` <dbl>

Visualizing

revenue %>% 
  filter(GEO == 'Canada') %>% 
  # pass the result to ggplot() as the first argument
  ggplot(aes(Ref_Date, `Total revenues`)) +
  # now it switches to + to combine, which is ggplot's way
  geom_line()

Not the prettiest, but it works!

Visualizing: Labels

revenue %>% 
  filter(GEO == 'Canada') %>% 
  ggplot(aes(Ref_Date, `Total revenues`)) +
  geom_line() +
  labs(title = 'Total University and Degree-Granting College Revenue',
       x = 'Fiscal Year',
       y = 'Revenue ($ thousands)') +
  scale_y_continuous(labels = scales::comma)

Inline Calculation

revenue %>% 
  filter(GEO %in% c('Canada', 'Alberta', 'Ontario')) %>% 
  ggplot(aes(Ref_Date, `Tuition and other fees` / `Total revenues`, color = GEO)) +
  geom_line() +
  labs(title = 'Tuition and fees as a share of total revenue') +
  scale_y_continuous(labels = scales::percent)

Exercises

  1. Visualize another variable
  2. Set some appropriate labels and scales
  3. Graph a time subset (e.g., between 2005 and 2010)

Other dplyr “verbs”

Which province took in the most tuition revenue in 2010?

revenue %>%
  filter(GEO != 'Canada', Ref_Date == 2010) %>% 
  select(GEO, `Tuition and other fees`) %>% 
  arrange(desc(`Tuition and other fees`))
## # A tibble: 10 x 2
##                          GEO `Tuition and other fees`
##                        <chr>                    <dbl>
##  1                   Ontario                  3563011
##  2          British Columbia                  1009476
##  3                    Quebec                   825943
##  4                   Alberta                   711973
##  5               Nova Scotia                   305597
##  6                  Manitoba                   176165
##  7              Saskatchewan                   173600
##  8             New Brunswick                   146213
##  9 Newfoundland and Labrador                    61537
## 10      Prince Edward Island                    27878

Create a new variable

revenue %>% 
  select(Ref_Date, GEO, `Tuition and other fees`, `Total revenues`) %>% 
  mutate(tuition_share = `Tuition and other fees` / `Total revenues`)
## # A tibble: 176 x 5
##    Ref_Date                       GEO `Tuition and other fees`
##       <int>                     <chr>                    <dbl>
##  1     2000                   Alberta                   298464
##  2     2000          British Columbia                   321017
##  3     2000                    Canada                  3052960
##  4     2000                  Manitoba                    99932
##  5     2000             New Brunswick                    79153
##  6     2000 Newfoundland and Labrador                    49954
##  7     2000               Nova Scotia                   173980
##  8     2000                   Ontario                  1509442
##  9     2000      Prince Edward Island                    13778
## 10     2000                    Quebec                   407458
## # ... with 166 more rows, and 2 more variables: `Total revenues` <dbl>,
## #   tuition_share <dbl>

Keeps the same number of rows

Useful “mutations”

  • Arithmetic: +, -, *, /, ^
  • Modular arithmetic: %/% (integer division), %% (remainder)
  • Logs: log(), log2(), log10()
  • Offsets: lead(), lag()
  • Cumulative and rolling aggregates: cumsum(), cumprod(), cummin(), cummax(), cummean()
  • Logical comparisons: <, <=, >, >=, !=, ==
  • Ranking: min_rank(), row_number(), dense_rank(), percent_rank(), cume_dist(), ntile()
  • ifelse()
  • recode()
  • case_when()

Summarize (aggregate)

revenue %>% 
  group_by(GEO) %>% 
  summarise(avg_endowment_revenue = mean(Endowment)) %>% 
  arrange(-avg_endowment_revenue)
## # A tibble: 11 x 2
##                          GEO avg_endowment_revenue
##                        <chr>                 <dbl>
##  1                    Canada           526070.7500
##  2                   Ontario           211474.5625
##  3                    Quebec            87322.9375
##  4                   Alberta            80069.0000
##  5          British Columbia            75751.6250
##  6               Nova Scotia            31147.7500
##  7              Saskatchewan            21255.5625
##  8             New Brunswick            13778.1875
##  9 Newfoundland and Labrador             2877.6875
## 10                  Manitoba             1646.5000
## 11      Prince Edward Island              746.9375

Useful summarising functions:

  • mean(x), median(x)
  • sd(x), IQR(x) (interquartile range), mad(x) (median absolute deviation)
  • min(x), max(x), quantile(x, 0.25)
  • first(x), nth(x, 2), last(x)
  • n(x), n_distinct(x)
  • Counts and proportions of logical values: sum(x > 10), mean(y == 0)
    • TRUE is converted to 1 and FALSE to 0

Exercise:

Compute the total tuition revenue of the three western-most provinces in 2007.

revenue %>% 
  filter(GEO %in% c('British Columbia', 'Alberta', 'Saskatchewan'), Ref_Date == 2007) %>% 
  summarise(sum(`Tuition and other fees`))
## # A tibble: 1 x 1
##   `sum(\`Tuition and other fees\`)`
##                               <dbl>
## 1                           1374509

Group by new variable

These three provinces compared to all other provinces and national average.

revenue %>% 
  filter(Ref_Date == 2007) %>% 
  mutate(category = case_when(
    GEO %in% c('British Columbia', 'Alberta', 'Saskatchewan') ~ '3 Western Provinces',
    GEO == 'Canada' ~ GEO,
    TRUE ~ 'All Other Provinces'
  )) %>% 
  group_by(category) %>% 
  summarise(sum(`Tuition and other fees`))
## # A tibble: 3 x 2
##              category `sum(\`Tuition and other fees\`)`
##                 <chr>                             <dbl>
## 1 3 Western Provinces                           1374509
## 2 All Other Provinces                           4079866
## 3              Canada                           5454375

Change relative first year

tuition_relative_change <- revenue %>% 
  arrange(Ref_Date, GEO) %>% 
  group_by(GEO) %>%
  mutate(rel_change = (`Tuition and other fees` - first(`Tuition and other fees`)) / first(`Tuition and other fees`)) %>% 
  select(Ref_Date, GEO, `Tuition and other fees`, rel_change)

ggplot(tuition_relative_change, aes(Ref_Date, rel_change, color = GEO)) +
  geom_line()

Logical sums

Count years with positive endowment income by province

revenue %>% 
  group_by(GEO) %>% 
  summarise(num_pos_endow = sum(Endowment > 0))
## # A tibble: 11 x 2
##                          GEO num_pos_endow
##                        <chr>         <int>
##  1                   Alberta            13
##  2          British Columbia            13
##  3                    Canada            13
##  4                  Manitoba            14
##  5             New Brunswick            13
##  6 Newfoundland and Labrador            13
##  7               Nova Scotia            14
##  8                   Ontario            12
##  9      Prince Edward Island            13
## 10                    Quebec            16
## 11              Saskatchewan            12

Exercises

  • Compare the Maritimes to the rest of Canada on one or more variable
  • Between 2005 and 2007, which provinces received more than $100 M in Total grants?
  • Which provinces, other than Nova Scotia and Ontario, had more than $20 M in Endowmennt income in 2003?
  • Which province was in “third place” in total revenue in 2012?

purrr (quickly)

revenue %>% 
  filter(GEO != 'Canada') %>% 
  group_by(Ref_Date) %>% 
  summarise(revenue_quantiles = list(quantile(`Total revenues`, c(0.25, 0.5, 0.75)))) %>% 
  mutate(
    low_25 = map_dbl(revenue_quantiles, "25%"),
    mid = map_dbl(revenue_quantiles, '50%'),
    high_75 = map_dbl(revenue_quantiles, '75%')
  )
## # A tibble: 16 x 5
##    Ref_Date revenue_quantiles   low_25       mid high_75
##       <int>            <list>    <dbl>     <dbl>   <dbl>
##  1     2000         <dbl [3]> 424965.0  681741.5 1944551
##  2     2001         <dbl [3]> 417664.5  700424.0 2126550
##  3     2002         <dbl [3]> 465112.5  729434.0 2378510
##  4     2003         <dbl [3]> 480403.8  804636.5 2708008
##  5     2004         <dbl [3]> 530995.2  853131.5 2818174
##  6     2005         <dbl [3]> 560023.2  911819.0 3257599
##  7     2006         <dbl [3]> 616411.5  964536.5 3320636
##  8     2007         <dbl [3]> 592926.8  944042.5 3560364
##  9     2008         <dbl [3]> 568460.8  968932.5 3404852
## 10     2009         <dbl [3]> 741015.5 1161066.0 4363439
## 11     2010         <dbl [3]> 787081.5 1213456.0 4436956
## 12     2011         <dbl [3]> 775058.8 1169906.5 4177081
## 13     2012         <dbl [3]> 799796.8 1240542.0 4296184
## 14     2013         <dbl [3]> 791843.2 1298845.0 4437751
## 15     2014         <dbl [3]> 799297.2 1316785.0 4623504
## 16     2015         <dbl [3]> 770570.8 1261935.5 4451983

Vectors (and other types)

Missing data

NA > 5
## [1] NA
10 == NA
## [1] NA
NA + 10
## [1] NA

What about this?

NA / 2

Filtering and NAs

x <- NA
is.na(x)
## [1] TRUE

filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA values. If you want to preserve missing values, ask for them explicitly:

test_data <- tibble(x = c(1, NA, 3))
filter(test_data, x > 1)
## # A tibble: 1 x 1
##       x
##   <dbl>
## 1     3
filter(test_data, is.na(x) | x > 1)
## # A tibble: 2 x 1
##       x
##   <dbl>
## 1    NA
## 2     3

Dealing with vectors…

c(1, 2, 3)
## [1] 1 2 3
1:3
## [1] 1 2 3

Dealing with vectors…

1:3 + 1:9
## [1]  2  4  6  5  7  9  8 10 12

What about…

1:3 + 1:10

Dealing with vectors…

1:3 + 1:10
## Warning in 1:3 + 1:10: longer object length is not a multiple of shorter
## object length
##  [1]  2  4  6  5  7  9  8 10 12 11

What happened?

Missing values

mean(c(1,2,3))
## [1] 2

What do we expect here?

mean(c(1,NA,3))

Missing values

mean(c(1,NA,3))
## [1] NA
mean(c(1,NA,3), na.rm = TRUE)
## [1] 2

Vectors

Vectors

  • Coercion (explicit and implicit)
    • Type hierarchy
  • Atomic vectors (homogenous) vs lists (can nest/heterogeneous)
  • Names
  • Subsetting

Vectors: Naming

c(x = 1, y = 2, z = 4)
## x y z 
## 1 2 4
set_names(1:3, c("a", "b", "c"))
## a b c 
## 1 2 3

Vectors: Subsetting

# positive values
x <- c("one", "two", "three", "four", "five")
x[c(3, 2, 5)]
## [1] "three" "two"   "five"
# negative values
x[c(-1, -3, -5)]
## [1] "two"  "four"
#> [1] "two"  "four"

# subset a named vector w/character vector
x <- c(abc = 1, def = 2, xyz = 5)
x[c("xyz", "def")]
## xyz def 
##   5   2

Vectors: Logical subsetting

x <- c(10, 3, NA, 5, 8, 1, NA)

# All non-missing values of x
x[!is.na(x)]
## [1] 10  3  5  8  1
#> [1] 10  3  5  8  1

# All even (or missing!) values of x
x[x %% 2 == 0]
## [1] 10 NA  8 NA
#> [1] 10 NA  8 NA

Lists

x <- list(1, 2, 3)
x
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3
str(x) # str...ucture
## List of 3
##  $ : num 1
##  $ : num 2
##  $ : num 3

Can contain different types of objects:

y <- list("a", 1L, 1.5, TRUE)
str(y)
## List of 4
##  $ : chr "a"
##  $ : int 1
##  $ : num 1.5
##  $ : logi TRUE

Or other lists:

z <- list(list(1, 2), list(3, 4))
str(z)
## List of 2
##  $ :List of 2
##   ..$ : num 1
##   ..$ : num 2
##  $ :List of 2
##   ..$ : num 3
##   ..$ : num 4

Additional Resources

Online Resources

Start here:

Additional:

For help/community:

References